35 research outputs found
Face-StyleSpeech: Improved Face-to-Voice latent mapping for Natural Zero-shot Speech Synthesis from a Face Image
Generating a voice from a face image is crucial for developing virtual humans
capable of interacting using their unique voices, without relying on
pre-recorded human speech. In this paper, we propose Face-StyleSpeech, a
zero-shot Text-To-Speech (TTS) synthesis model that generates natural speech
conditioned on a face image rather than reference speech. We hypothesize that
learning both speaker identity and prosody from a face image poses a
significant challenge. To address the issue, our TTS model incorporates both a
face encoder and a prosody encoder. The prosody encoder is specifically
designed to model prosodic features that are not captured only with a face
image, allowing the face encoder to focus solely on capturing the speaker
identity from the face image. Experimental results demonstrate that
Face-StyleSpeech effectively generates more natural speech from a face image
than baselines, even for the face images the model has not trained. Samples are
at our demo page https://face-stylespeech.github.io.Comment: Submitted to ICASSP 202
Grad-StyleSpeech: Any-speaker Adaptive Text-to-Speech Synthesis with Diffusion Models
There has been a significant progress in Text-To-Speech (TTS) synthesis
technology in recent years, thanks to the advancement in neural generative
modeling. However, existing methods on any-speaker adaptive TTS have achieved
unsatisfactory performance, due to their suboptimal accuracy in mimicking the
target speakers' styles. In this work, we present Grad-StyleSpeech, which is an
any-speaker adaptive TTS framework that is based on a diffusion model that can
generate highly natural speech with extremely high similarity to target
speakers' voice, given a few seconds of reference speech. Grad-StyleSpeech
significantly outperforms recent speaker-adaptive TTS baselines on English
benchmarks. Audio samples are available at
https://nardien.github.io/grad-stylespeech-demo.Comment: ICASSP 202
KALA: Knowledge-Augmented Language Model Adaptation
Pre-trained language models (PLMs) have achieved remarkable success on
various natural language understanding tasks. Simple fine-tuning of PLMs, on
the other hand, might be suboptimal for domain-specific tasks because they
cannot possibly cover knowledge from all domains. While adaptive pre-training
of PLMs can help them obtain domain-specific knowledge, it requires a large
training cost. Moreover, adaptive pre-training can harm the PLM's performance
on the downstream task by causing catastrophic forgetting of its general
knowledge. To overcome such limitations of adaptive pre-training for PLM
adaption, we propose a novel domain adaption framework for PLMs coined as
Knowledge-Augmented Language model Adaptation (KALA), which modulates the
intermediate hidden representations of PLMs with domain knowledge, consisting
of entities and their relational facts. We validate the performance of our KALA
on question answering and named entity recognition tasks on multiple datasets
across various domains. The results show that, despite being computationally
efficient, our KALA largely outperforms adaptive pre-training. Code is
available at: https://github.com/Nardien/KALA/.Comment: NAACL 202
ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-Speech Synthesis with Diffusion and Style-based Models
Emotional Text-To-Speech (TTS) is an important task in the development of
systems (e.g., human-like dialogue agents) that require natural and emotional
speech. Existing approaches, however, only aim to produce emotional TTS for
seen speakers during training, without consideration of the generalization to
unseen speakers. In this paper, we propose ZET-Speech, a zero-shot adaptive
emotion-controllable TTS model that allows users to synthesize any speaker's
emotional speech using only a short, neutral speech segment and the target
emotion label. Specifically, to enable a zero-shot adaptive TTS model to
synthesize emotional speech, we propose domain adversarial learning and
guidance methods on the diffusion model. Experimental results demonstrate that
ZET-Speech successfully synthesizes natural and emotional speech with the
desired emotion for both seen and unseen speakers. Samples are at
https://ZET-Speech.github.io/ZET-Speech-Demo/.Comment: Accepted by INTERSPEECH 202
Knowledge Graph-Augmented Language Models for Knowledge-Grounded Dialogue Generation
Language models have achieved impressive performances on dialogue generation
tasks. However, when generating responses for a conversation that requires
factual knowledge, they are far from perfect, due to an absence of mechanisms
to retrieve, encode, and reflect the knowledge in the generated responses. Some
knowledge-grounded dialogue generation methods tackle this problem by
leveraging facts from Knowledge Graphs (KGs); however, they do not guarantee
that the model utilizes a relevant piece of knowledge from the KG. To overcome
this limitation, we propose SUbgraph Retrieval-augmented GEneration (SURGE), a
framework for generating context-relevant and knowledge-grounded dialogues with
the KG. Specifically, our SURGE framework first retrieves the relevant subgraph
from the KG, and then enforces consistency across facts by perturbing their
word embeddings conditioned by the retrieved subgraph. Then, we utilize
contrastive learning to ensure that the generated texts have high similarity to
the retrieved subgraphs. We validate our SURGE framework on OpendialKG and
KOMODIS datasets, showing that it generates high-quality dialogues that
faithfully reflect the knowledge from KG.Comment: Preprint. Under revie